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Abstract —Benchmarking digital watermarking algorithms is 
not an easy task because different applications of digital wa¬ 
termarking often have very different sets of requirements and 
trade-offs between conflicting requirements. While there have 
been some general-purpose digital watermarking benchmarking 
systems available, they normally do not support complicated 
benchmarking tasks and cannot be easily reconfigured to work 
with different watermarking algorithms and testing conditions. In 
this paper, we propose OR-Benchmark, an open and highly recon¬ 
figurable general-purpose digital watermarking benchmarking 
framework, which has the following two key features: 1) all 
the interfaces are public and general enough to support all 
watermarking applications and benchmarking tasks we can think 
of; 2) end users can easily extend the functionalities and freely 
configure what watermarking algorithms are tested, what system 
components are used, how the benchmarking process runs, and 
what results should be produced. We implemented a prototype of 
this framework as a MATLAB software package and used it to 
benchmark a number of digital watermarking algorithms involv¬ 
ing two types of watermarks for content authentication and self¬ 
restoration purposes. The benchmarking results demonstrated 
the advantages of the proposed benchmarking framework, and 
also gave us some useful insights about existing image authen¬ 
tication and self-restoration watermarking algorithms which are 
an important but less studied topic in digital watermarking. 

Index Terms —Digital Watermarking, Benchmarking, Perfor¬ 
mance Evaluation, Reconfigurability, Content Authentication, 
Self-restoration 


I. Introduction 

D igital watermarking, a branch of information hiding, 
involves research on the process of embedding digital 
information (watermark) within a cover signal to achieve 
different (often security-related) functionalities related to the 
cover signal and/or its consumption by end users ||T|. Since the 
late 1980s a large number of digital watermarking algorithms 
have been proposed for many applications with different 
system requirements mostly for protecting different types of 
multimedia data such as still images, audio, video, 3-D models 
For instance, due to the convenience of transmission 
and storage for digital multimedia data on the Internet, copy¬ 
right protection of digital multimedia content has become one 
main application of digital watermarking. In this application, 
robust watermarking schemes 0, @-0) are desired to em¬ 
bed copyright information as a watermark in the digital media 
that can be hard to remove. Other applications of digital water¬ 
marking include content authentication, transaction tracking. 
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usage control, self-restoration, broadcast monitoring, etc. In 
some multimedia content authentication applications, fragile 
watermarking schemes are desired because of the need to 
capture any change to the content, which is often achieved via 
fragility of digital watermarks embedded |fT^-||T4|. In some 
other multimedia content authentication applications, however, 
semi-fragile watermarking schemes Gg-Gz) are desired to 
tolerate benign signal processing operations on watermarked 
multimedia data while malicious alterations should still be 
detectable, which is important for today’s multimedia systems 
involve complicated processes between the sender and the 
receiver of multimedia contents. 

There are a number of properties associated with a digital 
watermarking algorithm depending on different application 
requirements. It is well accepted that imperceptibility and 
robustness are the two most important but normally conflicting 
requirements. Besides, embedding capacity/efflciency, security 
(i.e. the ability to resist malicious attacks) and computational 
complexity are also important properties for most digital 
watermarking systems. However, the importance of each prop¬ 
erty is different in different applications. Some properties 
also overlap with each other, e.g. security is often linked to 
robustness against malicious signal processing (attacks). For 
instance, in copyright protection applications, the requirement 
on robustness is critical as the digital watermark need to 
survive both benign signal processing and malicious attacks, 
however, in content authentication applications, fragility (i.e. 
lack of robustness) is required to detect malicious content 
manipulations.There are also some additional application- 
oriented properties, e.g. reliability (normally measured using 
decoding error rates or correlation of decoded watermark with 
the original watermark) in copyright protection applications, 
accuracy (normally measured using false positive and false 
negative rates) in content authentication applications, and 
perceptual quality of the recovered cover in self-restoration 
applications. 

As in many other multimedia systems, a general-purpose, 
flexible and fair benchmarking environment with appropriate 
test criteria is of particular importance for performance eval¬ 
uation and comparison of digital watermarking algorithms. 
In the literature most researchers compared their digital wa¬ 
termarking algorithms with competitive ones by looking at 
a number of selected testing criteria for one or more target 
applications. However, because of different testing conditions 
and the lack of details of the experimental setups, it is hardly 
possible to depend on published results to do performance 
comparison. Thus, it is often needed to repeat the performance 
evaluation process for previous algorithms under the same 
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testing condition to have a fair comparison with a new algo¬ 
rithm proposed, which calls for the need of a general-purpose 
benchmarking system that can facilitate the performance eval¬ 
uation/comparison process and maximize the reuse of previous 
results. With a properly-designed benchmarking system, end 
users and researchers can conduct performance evaluation 
of a given algorithm and compare performance of multiple 
algorithms more easily and fairly to know more about pros and 
cons of different algorithms and to draw more insights about 
how to further improve existing algorithms. Since the 1990s, 
a number of digital watermarking benchmarking systems have 
been proposed |T8)-@. 

Generally speaking, benchmarking performance of digital 
watermarking algorithms is not an easy task because different 
digital watermarking applications often have very different 
sets of requirements and trade-offs among conflicting require¬ 
ments. When multiple digital watermarking algorithms with 
changeable parameters have to be evaluated against each other, 
the benchmarking task becomes more complicated. Further¬ 
more, for systems involving more than one type of watermarks, 
e.g. content authentication watermarking with the capability 
of self-restoration, the complexity of the benchmarking task 
becomes even higher. While there have been some general- 
purpose digital watermarking benchmarking systems avail¬ 
able, most of them can be applied to only certain digital 
watermarking systems for a limited range of applications. 
In addition, existing benchmarking systems normally do not 
support complicated benchmarking tasks and cannot be easily 
reconfigured to work with different algorithms and testing 
conditions. It is thus still a challenge to design an efficient 
and general-purpose benchmarking system that can be used to 
benchmark different digital watermarking algorithms. 

In this paper, we propose OR-Benchmark, an open and 
highly reconflgurable general-purpose framework for bench¬ 
marking digital watermarking algorithms, which is designed 
to meet the needs of different digital watermarking algorithms 
and various benchmarking tasks. Its main features include: 

• The framework has open interfaces for (re)conflguring 
different parts of the benchmarking system and addition 
of new modules|^ We plan to release the implemented 
prototype of the framework as an open-source tool. 

• The framework defines a unified procedure of benchmark¬ 
ing different digital watermarking algorithms against dif¬ 
ferent attacks and using different performance indicators 
to make the comparison fairer and more systematic. 

• The framework is designed to be independent of the 
media type, so it can be applied to digital watermarking 
algorithms for different media types although in this pa¬ 
per we will only demonstrate it for image watermarking. 

We implemented a prototype of the proposed OR- 
Benchmark framework as a MATLAB software package. To 
demonstrate how the framework can be used to benchmark 
digital watermarking systems, we used the implemented proto¬ 
type system to benchmark three recently proposed semi-fragile 
image watermarking algorithms for content authentication 

'Adding new source code is unavoidable for new functionalities, so our 
focus is how easy one can add own source code to extend its functionalities. 


and self-restoration. Those benchmarked digital watermarking 
systems use two types of watermarks (one for content authen¬ 
tication and the other for self-restoration), so are among the 
most complicated digital watermarking systems one may need 
to benchmark. The results on one hand proved the advantages 
of the proposed framework, and on the other hand led to some 
insights about how to better compare performance of such 
complicated digital watermarking systems and further improve 
their performance. 

The rest of the paper is organized as follows. In Section [11] 
related work on digital watermarking benchmarking is intro¬ 
duced. Section]^ gives a detailed description of our proposed 
benchmarking framework, including our abstract modelling of 
digital watermarking systems, important evaluation criteria, 
the proposed OR-Benchmark framework, and comparison with 
other existing digital watermarking benchmarking systems. 
Next, in Section IV we describe how we implemented a first 
prototype of OR-Benchmark in MATLAB, and results of using 
the implemented prototype system to benchmark a number of 
digital image watermarking systems for content authentication 
and self-restoration. In Section |V] we discuss some subtle 
aspects about digital watermarking benchmarking and how 
we currently handle them in OR-Benchmark. The paper is 
concluded by Section |Vl] with future work. 


II. Related Work 

Benchmarking of digital watermarking algorithms is the 
process of evaluating and comparing their performance under 
a fair and normally (semi-)automated environment. While 
there have been a substantial number of digital watermark¬ 
ing algorithms proposed for different applications and usage 
scenarios, there are relatively less research on digital water¬ 
marking benchmarking especially general-purpose frameworks 
capable of handling multiple applications with different sets 
of requirements. Most existing digital watermarking bench¬ 
marking systems focus on some well-defined sub-areas among 
which image watermarking received the most attention. In this 
section, we briefly overview some representative work. 


A. StirMark 

StirMark, one of the earliest and the most well-known 
digital watermarking benchmarking systems, was firstly pro¬ 
posed by Petitcolas et al. in 1998 p8t as a generic tool for 
benchmarking digital image watermarking algorithms against 
various attacks, which was later contributed by more re¬ 
searchers in 2001 m to become a more general framework 
for benchmarking digital watermarking algorithms. Subse¬ 
quently, several enhanced versions of StirMark were developed 
to include more attacks and cover audio watermarking | [25) , 
p6| . The main aim of StirMark is to develop a fully auto¬ 
mated evaluation service, which could encapsulate different 
performance evaluation indicators and allow continuous devel¬ 
opment of new attacks to be integrated into the whole system. 
Since StirMark is among the most widely-used benchmarking 
systems by the digital watermarking community, we discuss 
it in greater detail below. 
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Fig. 1: The architecture of StirMark for watermarking evaluation |24|. 


1) Interfaces: To use StirMark for benchmarking a given 
digital watermarking algorithm, the user is required to sup¬ 
ply three functions. Embed and Extract functions, and one 
GetSchemeInfo function which provides meta-information 
about the algorithm such as the name, version, author(s), 
the maximum byte-length of the embedded message, the 
maximum bite-length of the stego-key, etc. 

In order to support different use cases and digital wa¬ 
termarking algorithms, several parameters are provided in¬ 
cluding some mandatory parameters such as the strength for 
embedding/extraction, the key for embedding/extraction, the 
watermark to be embedded and extracted, and some optional 
parameters such as the maximum distortion tolerated and the 
certainty of extraction (Le. a number between 0 and 100 
indicating the probability of an embedded watermark being 
correctly detected). All the parameters are used to support 
various types of algorithms, but users cannot easily add new 
parameters without changing the source code of StirMark. 

Since different watermarking algorithms have different eval¬ 
uation requirements, StirMark divides watermarking algo¬ 
rithms into six categories according to blindness and the 
output of the Extract function. According to the algorithm 
type, StirMark defines different sets of input and output 
arguments for watermark embedding and extracting functions, 
and different sets of tests listed in the evaluation profiles. 

2) Evaluation Criteria: The main performance indicators 
of a digital watermarking algorithm StirMark can evaluate 
include imperceptibility, capacity, robustness to attacks, false 
alarm rate and execution speed as discussed below. 

Imperceptibility is evaluated as the perceptual quality distor¬ 
tion introduced to the cover signal by the watermark embed¬ 
ding process. StirMark uses PSNR as the default perceptual 
visual quality assessment (PQA) metric and in principle allows 
the use of other PQA metrics. However, adding other PQA 
metrics requires modifying the source code of the StirMark 
implementation related to imperceptibility evaluation such as 


the “robustness vs. visual quality” test. 

Normally the embedding capacity is a fixed constraint, so 
StirMark does not directly measure this but uses it to inform 
the robustness testing process where the watermark has a 
random payload with a given size. StirMark provides tools 
to analyze relation between capacity and robustness. 

Regarding robustness to attacks, StirMark implementation 
models attacks as C-H- classes and allows addition of new 
attacks as new classes to test. 

The false alarm rate is also known as “false positive rate” 
which contains two cases; 1) the detector reports a mark in a 
signal without a mark; 2) the detector reports a mark w' in 
a signal marked with w f w'. In StirMark, the first case is 
evaluated by taking some randomly selected signals without 
any watermark and sending them to the detector to see if the 
detector reports a watermark (wrongly), and the second case is 
evaluated by taking some marked signals and run them through 
the detector to see if a wrong watermark is detected. 

In StirMark, the execution speed is evaluated by comput¬ 
ing the average CPU times spent on the watermark embed¬ 
ding/extraction processes for a given signal of a particular size 
and on a particular platform. 

3) Benchmarking Framework: Figure [T] shows the archi¬ 
tecture of StirMark as a watermarking evaluation framework. 
There are six main components in the framework including 
the marking scheme library, test library, evaluation profile 
library, quality metrics library, multimedia database and results 
database. According to different application requirements, 
there are different evaluation profiles, each of which is com¬ 
posed by a list of tests or attacks to be applied and a list of 
multimedia signals required for the test. The end user is re¬ 
quired to add the watermarking algorithm under testing (in the 
form of three C-n- functions including GetSchemeInfo, Embed 
and Extract) to the marking scheme library. In GetSchemeInfo 
function, the end users also selects which evaluation profile 
will be used. The evaluation profiles are written as INI 
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files with a pre-defined static structure, so although users 
can define their own profiles they are limited to the static 
structure. Extending the structure of the evaluation profiles 
requires changes to the StirMark implementation’s source 
code. According to the information provided by the end user, 
StirMark runs the defined benchmarking process automatically 
by using its multimedia database, the tests (attacks) library and 
the quality metrics library. The results are stored in a database 
(an SQL server as stated in p4) and simple files as in actual 
implementations). 

StirMark is designed to achieve simplicity (to conduct tests 
and record results automatically) and customization (to choose 
different evaluation profiles by the end user). However, the 
boundaries among watermarking library, evaluation profiles, 
test library and quality metrics is unclear. For instance, the 
test library associates with not only the evaluation profile, but 
also the quality metrics and some information about parameter 
settings in watermarking scheme library. Although the authors 
of 12^ mentioned that StirMark allows the addition and use of 
new test and quality metrics, however, it is not easy to do so as 
the interfaces among different parts of the framework are not 
all clearly defined and manual changes to core StirMark source 
code (in C++) are always required. Furthermore, there are only 
a limited number of evaluation profiles in the current StirMark 
implementation which can only be used to benchmark some 
limited types of digital watermarking schemes. 

4) Implementation: StirMark was originally developed by 
Markus Kuhn in 1997 p7| as a generic software tool for 
simple robustness testing of image watermarking algorithms. 
It simulates many common attacks to image watermarking 
algorithms including random bilinear geometric distortions 
to de-synchronize watermarking algorithms. In it was 
suggested that image watermarking algorithms which do not 
survive StirMark should be considered unacceptably insecure. 

Subsequently, further development of StirMark was taken 
over by Fabien Petitcolas and it was incrementally improved 
by Petitcolas and some other researchers for more digital 
watermarking applications to become a “fair” benchmarking 
system with a longer list of tests and attacks with the release 
of its 3.1 version in 1999 [28) . Later on some more devel¬ 
opment work took place, including a set of tests for audio 
watermarking developed by Steinebach et al. | [25j and by Lang 
and Dittmann pS\ . There was also efforts of making StirMark 
a public automated web-based evaluation service made by 
Petitcolas et al. | |24l which led to the 4.0 version of Stirmark 
129) . The StirMark implementation was written using C++, 
and it has some level of reconfigurability in terms of an INI file 
where the end user can define a specified evaluation profile to 
list all the tests with relevant parameters and all the multimedia 
objects required for the tests. 

5 ) Limitations: Although StirMark has been widely used as 
a tool for robustness and security evaluation of digital water¬ 
marking algorithms, we feel it has the following limitations. 

The modelling and interface for digital watermarking algo¬ 
rithms do not cover all applications. For instance, there are 
only two types of output for watermark detection (i.e. the 
Extract function): the extracted watermark and a certainty 
to show the probability whether the watermark is detected 


correctly. This is obviously not sufficient to support digital 
watermarking algorithms for tamper localization or image 
restoration purposes. 

StirMark is reconfigurable but the level of reconfigurability 
is limited. Reconfiguring StirMark for a digital watermarking 
algorithm can be done by defining the input and output 
arguments according to one of the six pre-defined types of 
algorithms, but adding new parameters and extending existing 
parameter settings will require changing the source code 
of the StirMark implementation (in C++). For example, the 
strength parameter in StirMark is set to be a single floating¬ 
point number with many hard constraints (e.g. minimum and 
maximum values are linked to specific PSNR values), but for 
digital watermarking algorithms the strength could be a more 
complicated parameter such as a vector comprised of two or 
more numbers controlling different parts of the watermark 
embedding process such as the size of single watermark and 
the number of duplicate watermarks embedded. 

Although StirMark allows adding new tests, attacks and 
PQA metrics, the unclear boundaries among components make 
it hard to do so without making changes to the source code of 
the StirMark implementation. Adding some new test, attack 
and quality metric may require a re-design of the framework, 
e.g. if a non-PS NR PQA metric is introduced the strength 
parameter will need re-defining and many existing components 
need adapting to the new PQA metric. 

The StirMark framework defined in 1241 and shown in Fig. [T] 
does not follow a clear data flow, e.g. the test library does not 
really flow into the evaluation profile but read data from the 
latter and the multimedia database. 

In 124) StirMark is described to work with an SQL server 
to store all the evaluation results which can then be converted 
into web pages for reporting. However, the SQL-based web 
service has not been actually implemented. Instead, the latest 
C++ implementation of StirMark |29| produces a plain data 
sheet to store the evaluation results which cannot be easily 
converted into other format or used to do further analysis. 


B. Other Benchmarking Systems 

Checkmark was developed by Pereira et al. 0 
and downloadable from http://cvml.unige.ch/ResearchProjects/ 
Watermarking/Checkmark/ (now discontinued). Checkmark 
was based on StirMark with the following main changes. First 
of all, a number of new attacks, which take into account sta¬ 
tistical properties of images and watermarks, are incorporated 
into Checkmark. The detailed descriptions of most attacks are 
provided in |21|, (ig, 0. Secondly, weighted PSNR and 
Watson’s metric are used as new metrics for evaluating image 
quality instead of just PSNR. Thirdly, evaluation results are 
represented in a flexible XML format and can be automatically 
converted into HTML web pages. To use Checkmark, users 
need to supply some original images and their watermarked 
editions, and then customize two initial functions (one is used 
to inform Checkmark about the input images, and the other is 
used to define the watermark detector which should return a 
binary output indicating the result of the watermark detection 
process). Despite the changes to StirMark, the reconfigurabil- 
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ity of Checkmark remains relatively low so normally users 
have to make changes to Checkmark’s source code. 

Optimark | |20) is a benchmarking software package for im¬ 
age watermarking algorithms downloadable from p2| , provid¬ 
ing a graphical user interface (GUI) developed using C/C-H-. 
To use Optimark for benchmarking a digital watermarking 
algorithm, users can choose a set of test images, define 
different watermark embedding keys and watermark messages 
for multiple trials of the watermarking detector and decoder, 
and select a set of attacks among 14 types of attacks and 
attack combinations. It allows evaluation of several statistical 
characteristics of an image watermarking algorithm, including 
Receiver Operating Characteristic (ROC) curves with false 
positive and false negative rates as watermark detection per¬ 
formance metrics. OptiMark also supports combining multiple 
ROC curves to measure the overall performance by allowing 
users to set weights of a number of selected input images and 
attacks. 

Certimark is the outcome of an EU-funded research project 
(http://www.certimark.org/, lasting from 2000 to 2002). The 
objectives of Certimark are to design a benchmarking suite 
which permits users to assess the appropriateness and to set 
application scenarios for their needs, and to set up a standard 
certification process for watermarking technologies p2| . The 
benchmark system is a suite of modules, including image 
source, watermark embedder/decoder, attack model, compara¬ 
tor model, process-dependent metrics, report writer and result 
& certificate module, with the interfaces among different 
modules to guarantee the consistency along the benchmarking 
process. Although the reconfigurability level of Certimark is 
higher than earlier systems, Certimark seems to have been 
discontinued and there is no source code publicly available. 

Watermark Evaluation Testbed (WET) ||3^-||3^ is a web- 
based system developed by researchers from the Purdue Uni¬ 
versity to evaluate the performance of image watermarking 
algorithms. WET consists of three major components: front 
end, algorithm modules, and image database. To achieve the 
goal of extensibility, the GNU Image Processing Program 
(GIMP) is used because it support plug-ins and extensions. 
Some watermarking algorithms, StirMark 4.0 and some eval¬ 
uation metrics were implemented as GIMP plug-ins to be part 
of wet’s algorithm modules. The end users can select some 
images, one or more watermarking algorithms, attacks, and 
specify needed parameters via a web interface of the front 
end. The evaluation results can be shown as ROC curves. 
Similar to other systems, WET has a limited reconfigurability. 
In addition, its source code is not publicly available. 

OpenWatermark p7) , p8| is a web-based system for bench¬ 
marking of digital watermarking algorithms. It is composed of 
three parts: 1) a web server and a remote method invocation 
(RMI) client for users to submit their benchmarking requests 
with specifications of the benchmarked algorithms, 2) a cluster 
of RMI benchmark servers automating the benchmarking 
process, and 3) a SQL database sorting all data used in the 
benchmarking process and results produced by the benchmark 
servers. OpenWatermark also contains some reference attacks, 
evaluation metrics and test images as publicly available re¬ 
sources. OpenWatermark is able to support two typical use 


cases: watermark extraction test and watermark detection test. 
OpenWatermark allows benchmarked algorithms to be submit¬ 
ted as Windows/Linux executables or MATLAB/Python scripts 
and all its components were developed in Java, so it has some 
reconfigurability. However, to support more features such as 
benchmarking profiles and other media types its source code 
has to be modified. To some extent, OpenWatermark is more 
like an online service for end users to define a sequence of 
remote calls. OpenWatermark implementation was available to 
registered members at its website http://www.openwatermark. 
org/ which is currently unaccessible. 

Mesh Benchmark was proposed for 3D mesh water¬ 
marking. It contains three different components: a data set, 
a software tool and two evaluation protocols. The maximum 
root mean square error (MRMS) and the mesh structural 
distortion measure (MSDM) are used as perceptual distortion 
metrics. The attacks currently included in Mesh benchmark 
are: file attack, geometry attacks (similarity transformation, 
noise addition, smoothing, vertex coordinates quantization) 
and connectivity attacks (simplification, subdivision, crop¬ 
ping). As a benchmarking system focusing on 3D mesh wa¬ 
termarking only, it considers only the payload, distortion and 
robustness for performance evaluation. Besides, the evaluation 
protocols are defined with fixed steps and thresholds so the 
reconfigurability of the mesh benchmark is low. 


III. Proposed OR-Benchmark Eramework 

In this section, our proposed OR-Benchmark framework 
will be introduced in details. Eirstly, we discuss general mod¬ 
elling of digital watermarking systems used in OR-Benchmark 
in Section IIII-AI Then the evaluation criteria considered in 
OR-benchmark are discussed in Section [ml] After that, 
the architecture of the OR-benchmark framework and its 
open interfaces for end users are explained in details in 
Sections III-C and III-D| respectively. This section ends with 
an comparison between OR-benchmark and all benchmarking 
systems reviewed in Section [IlTE| 


A. Modelling of Watermarking Systems 

Eollowing the community’s common understanding, OR- 
benchmark models a digital watermarking system as two 
separate processes: the Sender which embeds one or more 
watermarks into a given cover work to generate a watermarked 
work; the Receiver which extracts and/or detects one or more 
watermarks that may have been embedded in a received test 
work. Both the Sender and the Receiver take at least one input 
(the cover or test work) but may take more optional inputs 
(some are parameters), and produce one or more outputs. 

The general models of the watermark embedding and ex¬ 
traction/detection processes are shown in Eig. As shown 
in Eig. I^a), the Sender will always have the cover work as 
the input and the watermarked work as the output. There are 
three groups of optional inputs including the watermark(s) 
to be embedded, the embedding key, and other optional 
parameters controlling the embedding process. Note that the 
watermark(s) in the embedding process can be either an input 
(supplied by the user) or an output (if generated by the Sender 
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automatically), which can be further used for performance 
evaluation purposes. As shown in Fig. [^b), the Receiver takes 
a minimum input (a test work) and possibly some other inputs 
and parameters to produce one or more outputs including one 
or more extracted watermarks, one or more binary decisions 
(if some given watermark(s) is/are detected), a restored work 
(if the watermarking algorithm supports self-restoration), and 
other outputs e.g. the confidence level and error rates. We 
model the inputs and outputs of the Sender and the Receiver 
this way to cover different types of digital watermarking 
algorithms and application scenarios. For example, “Binary 
Decision(s)” as an output could be a single number (to show 
whether a single given watermark is correctly detected or if the 
content of the test work is authenticated), or a binary matrix 
to show the authentication results of individual regions of the 
test work. 
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Fig. 2: Modelling of the Sender and the Receiver in OR- 
Benchmark. Dashed lines denote optional input/output. 


Another important part of performance evaluation of dig¬ 
ital watermarking algorithms is the communication channel 
between the sender and the receiver which can be used to 
model any intermediate processing on a watermarked work 
such as channel noises or any other unwanted distortions, 
benign processing such as re-compression in some applications 
scenarios, and attacks whose goal is to fail the watermark 
extraction/detection process. In OR-Benchmark the commu¬ 
nication channel is simulated as a black box called “channel 
simulator” taking a single input (a work) and producing a 
single output (a processed work), which can be used to 
cover everything that may happen between the sender and the 
receiver. We will discuss more about this in Sec. IIII-CI 


B. Performance Evaluation Criteria 

In OR-benchmark performance evaluation criteria (i.e., in¬ 
dicators) are organized into two categories; 1) built-in perfor¬ 
mance indicators that can be selected by users directly; 2) user- 
defined performance indicators that are supported indirectly 
by generating a comprehensive set of raw results for users 
to further process. In this section, the commonly required 


performance indicators for benchmarking digital watermarking 
algorithms are further discussed. 

Similar to StirMark, the properties designers and users 
of digital watermark algorithms wish to evaluate include 
imperceptibility, embedding capacity, robustness to benign 
processing and attacks, false alarm rates and the speed of 
execution. Since these common criteria have been discussed 
in Sec. II-A here we focus on only two other evaluation 
properties for content authentication and self-restoration water¬ 
marking algorithms that are not (well) supported by StirMark 
but essential for some application scenarios. 

Authentication Accuracy: For content authentication water¬ 
marking, there are two basic metrics to measure the authen¬ 
tication accuracy of the detection process: the false-positive 
(FP) rate indicating the level of errors for areas reported as 
“tampered”, and the false-negative (FN) rate indicating the 
level of errors for areas reported as “untampered”. Many 
other performance metrics can be derived from the FP and 
FN rates e.g. the average authentication rate used in |40| 
and the ROC curves widely used in the digital watermark¬ 
ing community and the machine learning community more 
broadly. OR-Benchmark supports the two main metrics and 
also provide needed raw data in the benchmarking results to 
allow more complicated user-defined metrics that cannot be 
derived directly from the FP and FN rates. 

Perceptual Quality of Recovered Work: For self-restoration 
watermarking algorithms (which require the use of content 
authentication watermarks), a key performance indicator is 
the perceptual quality of the recovered work. The perceptual 
quality can be measured in the same way as how the per¬ 
ceptual quality of a watermarked work is measured. In OR- 
Benchmark, some commonly used image quality assessment 
(IQA) metrics such as PSNR and SSIM are incorporated but 
users can add their own metrics easily via the open interface 
discussed in Sec. EED] For self-restoration watermarking 
algorithms, there is a question on if the perceptual quality 
should be calculated for the whole work or just the detected 
regions labelled as “tampered”. If the latter option is used, 
the tempered regions falsely reported as “untampered” will 
be missed so the result will be misleading. Therefore, OR- 
Benchmark measures the quality using the whole work. 


C. Our Benchmarking Framework 

In this subsection, we introduce the overall architecture of 
OR-Benchmark in details. Figurej^gives a schematic overview 
of the framework, which can be split into two parts; an Online 
Benchmarker takes input from the user and automates the 
benchmarking process to generate results for further analysis, 
and an Offline Analyzer allowing the user to conduct user- 
specific tasks {e.g. statistics and visualization) based on the 
(raw) results produced by the Online Benchmarker. The Offline 
Analyzer can be equipped by one or more Report Engines to 
produce more user-friendly reports of benchmarking tasks. The 
Report Engines may also access the results from the Online 
Benchmarker without passing the Offline Analyzer (in that case 
the Offline Analyzer can be seen as a simple data forwarder). 

The Online Benchmarker contains three groups of compo¬ 
nents: 1) the user-provided components - the Sender and the 
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Fig. 3: The architecture of the OR-Benchmark framework. 


Receiver provided by the user as the subject of benchmarking, 
2) a Multimedia Database holding the test media, an Attacks 
Library and a PE library providing attacks and performance 
evaluation algorithms, respectively, and 3) the core bench- 
marker part composed of a central Controller, a Channel 
Simulator enabling incorporation of different types of attacks 
and processing on a watermark work, and a Performance 
Evaluator which produces results to sore in a Results Database 
as the output of the whole benchmarking process. The central 
Controller interacts with the user to define the benchmarking 
profile, and with other components of the online benchmarker 
to automatically execute the profile. A benchmarking profile 
allows automatic testing of multiple parameters of the same 
digital watermarking algorithm, multiple attacks, multiple PE 
algorithms and multiple performance indicators. The Con¬ 
troller can also automatically determine default settings based 
on information given by the user to reduce the burden of the 
user to define the benchmarking profile. 


The whole benchmarking process works as follows from 
an end user’s point of view. The user first interacts with the 
Controller to define a benchmarking profile, for which (s)he 
provides own Sender and Receiver functions for benchmarking 
and defines what to benchmark. The user may also define the 
watermark(s) to be embedded if user-specific watermarks are 
required. It is possible to define how the watermarks are for¬ 
matted so that the Controller can generate them automatically. 
The user also needs to select test media from the Multimedia 
Database, possibly by extending the database with own test 
works. Based on the benchmarking profile, the Controller 
feeds selected test multimedia works and any meta-data to the 
Sender, selected attacks to the Channel Simulator, attacked 
works to the Receiver, and then selected PE algorithms to the 
output of the Receiver to produce data stored in the Results 
Database. 


D. Open Interfaces 

OR-Benchmark is designed to have open interfaces so that 
users can easily (re)configure and extend the framework and 
define different benchmarking tasks easily. Observing Eig. 
there are mainly the following interfaces. 

The interfaces between the Sender/Receiver and the 
core benchmarker allow user-defined digital watermarking 
algorithms to be benchmarked. Following the general models 
of the Sender and the Receiver discussed in Sec. |III-A| 
the interfaces are materialized as the input and output 
interfaces of two functional units: Sender: (Original 
Cover Work, [Watermark(s)], [Key], [...]) 
-e- (Watermarked Work, [Watermark (s )]), 

Receiver: (Test Work, [Watermark(s)], 

[Key],[...]) —> ([Watermark(s)], 

[Decision], [Restored Work], [...]), where 
arguments in the square brackets are optional and “...” denotes 
more optional arguments. A proper mechanism is required to 
inform the Controller about valid values each input argument 
can take and other meta information(e.g. the display name 
of each argument), in order to create benchmarking profiles 
for enumerating all values for any input argument of interest. 
The implementation of the interfaces differ depending on 
the programming language used, e.g. for object-oriented 
programming (OOP) languages they can be implemented 
as a class with methods representing the two functional 
units and member variables representing inputs, outputs and 
meta information, and for non-OOP languages two separate 
functions with optional parameters can be defined achieve the 
same goal. 

The interface between the Multimedia Database and the 
core benchmarker allows users to reconfigure and extend the 
Multimedia Database. This can be achieved by an agreed 
structure of the Multimedia Database such as a hierarchy 
structure of folders and files or using a human-readable 
configuration file (such as XML) to allow the system and end 























































users to find test multimedia works. Note that OR-Benchmark 
can support any media types so the Multimedia Database can 
be a mixture of different types of media hies including audio 
tracks, images, video sequences, 3D models and others. 

The interface between the Attacks Library and the core 
benchmarker allows users to reconhgure and extend the At¬ 
tacks Library used by the Channel Simualtor. As discussed in 
Sec. III-A an attack in the Attacks Library is a simple func¬ 
tional unit as follows: Attack: (Input Work, [...]) 
—> (Output Work) . Again, a mechanism is needed to 
convey meta information about any optional input arguments. 

The interface between the PE Library and the core bench- 
marker allows users to reconhgure and extend the PE Library 
used by the Performance Evaluator. There are different types 
of PE algorithms depending on the performance indicators 
used, so there are different input and output interfaces. An 
important class of PE algorithms are percepmal quality as¬ 
sessment (PQA) metrics dehned as follows: PQA: (Workl, 
Work2, [...]) —)■ (Metric) , where the output metric is 
a numeric rating of the perceptual quality. Again, a mechanism 
is needed to convey meta information about optional input 
arguments. PQA algorithms are generally objective ones based 
on automated computer programs, but it is possible to dehne a 
virtual functional unit where human experts (e.g. from crowd¬ 
sourcing websites) are involved to rate the quality subjectively. 

System search paths can be set up for all the above interfaces 
so that the Controller and other components of the core 
benchmarker can automatically discover candidate algorithms 
and test multimedia works. Each path can be a combination 
of local hie paths and URLs including web addresses. When 
web addresses are involved, a local caching mechanism may 
be created to allow fast retrieval of contents from remote 
resources. 

The interface between the core benchmarker and the Results 
Database allows users to reconhgure and extend the format of 
the results used by the Offline Analyzer and Report Engines. 
This can be achieved by a conhguration hie (e.g. an XML hie 
following a pre-dehned schema) indicating the format of the 
results of a particular benchmarking prohle. 

The interface between the user and the Controller allows 
creation of benchmarking prohles. Core elements of a bench¬ 
marking prohle include digital watermarking algorithm(s) 
tested and candidate values of input parameters, test multi- 
media works, selected attacks, selected PE algorithms, and 
format of the results. This can be implemented as a graphical 
user interface (GUI) and/or a human-readable conhgurable hie. 

The interface between the user and the Offline Analyzer 
and Report Engines allows users to investigate the raw results 
recorded in the Results Database in an interactive way and 
to produce more user-friendly reports. The interface for the 
Offline Analyzer can be implemented as a GUI, but the Report 
Engines could be just command-line tools invoked from the 
Offline Analyzer’s GUI. The format of the produced reports 
can be dehned using a human-readable conhgurable hie. 

E. Implementation 

We implemented a prototype of OR-Benchmark as a MAT- 
LAB software package which includes all key components 


shown in Eig. and the interfaces listed in Sec. III-D 
The prototype is built on MATLAB standard functions and 
toolboxes, and does not depend on any third-party libraries. 
MATLAB is selected considering its wide use in the digital 
watermarking community and the ease to dynamically extend 
the implemented system without compiling the whole source 
code. The cross-platform nature of MATLAB also makes the 
OR-Benchmark prototype more accessible to researchers using 
different operating systems. Although the prototype is fully 
functional (see a case study in Sec. we are still rehning it 
and plan to release a beta edition under an open source license 
once this paper is accepted for publication. 

The MATLAB prototype by default uses a number of pre- 
dehned folders to store hies and data in the Multimedia 
Database, a library of differen digital watermarking algorithms 
(each including a Sender and a Receiver functions), the Attacks 
Library, the PE Library, and the Results Database. The 
user can freely add new functions following the interfaces 


discussed in Sec. III-D to the corresponding folders to extend 
the system. The prototype can also be conhgured to use a 
search path including multiple folders for each of the above 
listed components. There is another folder keeping MATLAB 
scripts implementing the Controller, the Channel Simulator, 
the Performance Evaluator, and the Offline Analyzer. The Con¬ 
troller has both a GUI for creating the benchmarking profiles 
(see Eig. and a benchmarking scheduler for automatically 
executing benchmarking profiles. Given the flexible interfaces 
of and the meta information about digital watermarking algo¬ 
rithms, the Controller allows the user to dehne test multimedia 
works, candidate values of input arguments in the Sender 
and Receiver functions, selected attacks and PE algorithms, 
in order to schedule and launch a number of repeated runs 
of the digital watermarking process to produce all raw data 
and performance indicators recorded in the Results Database. 
Each benchmarking prohle created by the Controller is stored 
as a MATLAB structure variable in the workspace, and once 
the benchmarking task is completed the benchmarking prohle 
and the benchmarking results are saved as another MATLAB 
variable in a MATLAB data hie in a designated folder of the 
Results Database. Here, the benchmarking prohle is kept to 
inform the Offline Analyzer about the format of the results. 

The Offline Analyzer has a GUI for producing different 
kinds of 2-D plots based on raw data in the Results Database 
(see Eig. |^. At the current stage of development, the Offline 
Analyzer is designed to showcase what one can do with the 
OR-benchmark prototype (see a case study in Sec.|IV|l, so it is 
not a complete solution for all digital watermarking schemes 
yet. We plan to design a plug-in interface to allow different 
analysis and plotting functions to be incorporated into the 
Offline Analyzer. It deserves noting that the user can develop 
his/her own Offline Analyzer easily since the Results Database 
contains all needed data in a structured and directly accessible 
way. 

F. Comparison with Other Benchmarking Systems 

Comparing with other digital watermarking benchmarking 
systems and frameworks, OR-Benchmark has the most generic 
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Fig. 4; The main GUI of the central Controller for defin¬ 
ing benchmarking prohles as currently implemented in OR- 
Benchmark. 



Fig. 5; The main GUI of the Offline Analyzer as currently 
implemented in OR-Benchmark. 


modelling of digital watermarking systems which allows it to 
support all types of digital watermarking algorithms at the 
level of system modelling. While most other benchmarking 
systems can be extended to cover more types of digital 
watermarking algorithms, often significant changes to the 
source code of their implementations are required. Some other 
benchmarking systems model digital watermarking algorithms 
in a way such that it is hard to cover multiple watermarks 
(especially of different types) in the same cover work. This 
advantage of OR-Benchmark can be seen from the case study 
we will discuss in Sec. which is about benchmarking three 
image authentication and self-restoration digital watermarking 
algorithms involving two different types of watermarks for a 


single cover (one type for image authentication and the other 
for self-restoration). Such digital watermarking algorithms are 
among the most complicated ones and to our best knowledge 
no any other benchmarking system/framework can properly 
cover them in their current system models and implemen¬ 
tations. This was actually one of the main reasons why we 
decided to develop our own framework. 

Different from many other digital watermarking benchmark¬ 
ing systems, OR-Benchmark is designed to have openness 
and reconhgurability by design. The framework separates 
users, data, algorithms, the online benchmarker and the offline 
analyzer to achieve a more user-friendly data flow in the 
whole benchmarking process. Comparing Figs. with [T] we 
can see OR-Benchmark has a clearer separation of different 
components and a clearer data-flow path from the Sender to the 
Performance Evaluator. OR-Benchmark also has more clearly- 
defined interfaces to support different user-specific operations 
including creating benchmarking profiles, (re)configuring and 
extending the benchmarking system. Most other benchmarking 
systems also allow limited (re)configuration often via defini¬ 
tion of a user-specific evaluation profile {e.g. StirMark using 
an INI file), but adding new functional units will normally 
require updating the source code of their implementation {e.g. 
StirMark as a C-n- based system changes to key header files 
cannot be avoided). As a comparison, OR-Benchmark has 
open interfaces to allow reconfiguration and extension, and 
our MATLAB implementation allows new functional units to 
be added without touching any other parts of the core system 
(not even any configuration file since available functional units 
can be automatically discovered in the search paths of different 
components following the defined open interfaces). 

Another unique feature of OR-Benchmark is its support of 
all media types with a single model and process. In OR- 
Benchmark, digital watermarking of any media type can be 
handled in exactly the same way and the user does exactly the 
same steps to benchmark digital watermarking algorithm(s). 
Many functional units can be shared across different media 
types e.g. many PE algorithms can be applied to multiple 
media types. On the other hand, most other benchmarking 
systems focus on one or two particular media types (mostly 
digital images and some extended to cover audio) and the 
implementations are very much tuned to support the one or 
two media types. This is another reason why extensibility of 
other benchmarking systems is lower than OR-Benchmark. 

Our selection of using MATLAB to implement the OR- 
Benchmark prototype also contributes to the reconfigurability 
of OR-Benchmark. Most other benchmarking systems were 
developed based on compiled programming languages es¬ 
pecially C/-H-, which makes incorporation of source code 
written in other programming languages harder or impossible. 
MATLAB has built-in support for functional units written 
in most mainstream programming languages such as C/C-H-, 
Java, and Python, thus making the extension much more easier. 

IV. Case Study 

In this section, we demonstrate how our implemented OR- 
Benchmark prototype can be used via a case study on bench¬ 
marking three image watermarking algorithms for content 
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authentication and self-restoration. Such algorithms are among 
the most complicated ones with two types of watermarks per 
block of the cover work and are not supported by other bench¬ 
marking systems. While this section is mainly a case study for 
showcasing usefulness of OR-Benchmark, the watermarking 
algorithms benchmarked have never been compared in such a 
depth like we report here (which was a harder task due to the 
lack of proper benchmarking tools). 


A. Experimental Setup and User Operations 

The three image authentication and self-restoration digital 
watermarking algorithms benchmarked are the following: Lin 
and Chang’s scheme pTf (Ml), Li et aids scheme p2) (M2) 
and Wang et aids scheme |43| (M3), which all use two 
watermarks separately for image authentication and restoration 
of each 8x8 block. For each scheme, a Sender and a Receiver 
MATLAB functions were written following the interfaces for 
the two components and then copied to the folder holding 
all such functions. Those functions were then selected as the 
target of the benchmarking task via the Controller’s GUI. The 
GUI allows use of multiple candidate values of each parameter 
of each scheme, but for this case study we tuned the three 
schemes’ parameters so that the average perceptual quality 
of the watermarked images is roughly aligned to make the 
comparison fairer (see below for more details). 

For attacks, we chose simple “copy and paste attack”, 
JPEG compression, additive and multiplicative Gaussian white 
noises as four separate attacking algorithms each of which 
is injected into the Channel Simulator to create attacked 
watermarked images sent to the Receiver. All the attacks were 
implemented as separate MATLAB functions with additional 
input parameters. Those functions were added to the folder 
holding the Attacks Library and then selected (with different 
values of input parameters) via the Controller’s GUI. For the 
“copy and paste attack” 10% randomly-selected region of the 
whole image was copied and pasted to other regions of the 
same image. For JPEG compression, the QE (quality factor) 
is the only input parameter with values 100, 95, 90, ..., 50. 
Eor additive Gaussian white noise, the mean (with the only 
value 0) and the variance (with the values 1, 3, 5, ..., 39 using 
255 as the peak pixel value) are used as input parameters. Eor 
multiplicative Gaussian white noise, the same input parameters 
(the mean and the variance) and the variance’s values are 
different (1, 10, 20, ..., 240). The “copy and paste attack” was 
used as an always-on attack and optionally combined with one 
of other attacks for benchmarking robustness against attacks. 

Eor performance indicators, we considered imperceptibility 
{i.e., perceptual quality of watermarked images), authenti¬ 
cation accuracy (in terms of EP and EN rates), perceptual 
quality of recovered images (with and without attacks), and 
processing times of the Sender and the Receiver functions. 
Eor perceptual quality we chose PSNR and SSIM, which are 
the two most widely-used IQ A metrics. Each performance 
indicator is represented by one MATLAB function which was 
added to the folder holding the PE Library. The selection of 
the PE algorithms were also done via the Controller’s GUI. 

Eor the test images, we collected 100 8-bit gray-scale 
images of size 256 x 256, 384 x 256 and 512 x 512, which were 


added to a sub-folder of the folder holding the Multimedia 
Database. The images cover a broad range of image types e.g. 
outdoor or indoor scenes images, portraits, photos of natural or 
man-made objects, and texture images. The test images were 
selected by setting the test multimedia works to be all files 
from the corresponding sub-folder via the Controller’s GUI. 

All the above choices allowed the Controller to create 
a benchmarking profile. Eor the format of the results, we 
depended on the Controller to automatically create a default 
format to capture all raw data and performance indicators 
using a MATLAB variable including the benchmarking profile 
itself After the benchmarking profile was set up, we instructed 
the Controller to automatically run the benchmarking task to 
generate the results. The machine running the benchmarking 
task is a PC with an Intel Core 2 Duo CPU (3.16GHz) and 
2GB RAM. The concurrency support of the dual-core CPU 
was not enabled to get a better estimate of the processing 
times. The MATLAB version used is MATLAB R2012a. 

After the results were produced by the core benchmarker, 
the Offline Analyzer was used to generate some 2-D plots 
for a better understanding of the performance of the three 
benchmarked image watermarking schemes. Considering all 
the results we observed, it is clear that M3 has the best 
performance, followed by Ml and then M2. In the following, 
we show some selected benchmarking results we obtained. 

B. Imperceptibility 

Eigure shows the PSNR and SSIM values of all the 
100 test images after going through each of the three digital 
watermark embedding processes. Although we tried to align 
the perceptual quality to make the comparison fairer, there are 
noticeable fluctuations cross different test images due to the 
complexity of visual quality assessment. We managed to make 
the average PSNR values of the three digital watermarking 
schemes all between 36.5 and 36.7 dB (36.56 dB, 36.57 dB, 
36.68 dB, respectively). One interesting observation is that 
M3’s PSNR values are more fluctuated than Ml and M2, 
while their SSIM values have a similar level of fluctuation. 
Since SSIM is an IQA metric matching subjective quality 
better p^ , we thus consider the three digital watermarking 
schemes are aligned well. Note that the alignment process 
of the imperceptibility actually involved running the three 
digital watermarking schemes through all the test images using 
different parameters and then calculating the average IQA 
value for each parameter setting of each scheme. This process 
itself is actually a set of simple benchmarking profiles with 
only two performance indicators (average PSNR and SSIM 
values of all watermarked images). 

C. Authentication Accuracy 

To evaluate authentication accuracy of an image authenti¬ 
cation watermarking scheme, attacks manipulating contents of 
watermarked images should be considered. To this end, we 
applied the 10% “copy and paste attack” to each watermarked 
image and calculated the EP and EN rates by counting wrongly 
reported 8x8 blocks by the Receiver. Other attacks are not 
considered here so that we focus on the base line EP/EN rates. 
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Fig. 6; The quality comparison of watermarked images pro¬ 
duced by the three different watermarking schemes. The x-axis 
is the image index and the y-axis is the PSNR/SSIM value. 


For the FP rate. Ml and M3 have an almost zero rate for all 
images, and M2 has an average FP rate of 1.36%. For the FN 
rate. Ml has an almost zero rate, M3 has an average FN rate 
of 1.59%, and M2 has the worst rate of 3.02%. See Fig. |^for 
the FP and FN rates of all the 100 test images for the three 
digital watermarking schemes. 


D. Recovered Image Quality 

Similar to the case of authentication accuracy, for perceptual 
quality of recovered images we also focused on the condition 
where the 10% “copy and paste attack” is applied without 
other attacks. The mean PSNR values of 100 recovered images 
for Ml, M2 and M3 are 27.80, 27.92 and 32.32 dB, respec¬ 
tively. The mean SSIM values of 100 recovered images for Ml, 
M2 and M3 are 0.9249, 0.9270 and 0.9506, respectively. The 
results showed that M3 is the best scheme with a significantly 
better capability of recovering manipulated images. 


E. Processing Time 

Except the embedding process of Ml which took around 
2.6 seconds in average, all other processes of the three digital 
watermarking schemes consumed less than 1 second. Consid¬ 
ering MATLAB is much less effective than other compiled 
programming languages, the results suggest that all the three 
schemes are practical for real-world applications. 


E Robustness 

For benchmarking robustness, we combined the 10% “copy 
and paste attack” with one additional attack (JPEG compres¬ 
sion, additive and multiplicative Gaussian white noises) to 




Eig. 7; The EP and EN ratess of the three different watermark¬ 
ing systems. The x-axis is the image index and the y-axis is 
the EP/EN rate. 


gauge the robustness of each digital watermarking scheme 
against each additional attack. Note that each additional attack 
does not change the contents of watermarked images but tries 
to fail the authentication process.Eor each combination, the 
same performance indicators on authentication accuracy (EP 
and EN rates) and quality of recovered image were calculated 
against the parameter of each additional attack (QE for JPEG 
compression, variance for additive and multiplicative noises). 
Since now we have more factors to look at, we average 
the performance indicators cross all 100 images to get the 
average values which are then shown against the parameter 
value of each additional attack to see how the strength of the 
attack influences the performance of each digital watermarking 
scheme. The results are shown in Eigs. |8]and|g and more 
discussions are given below. 

1) JPEG Compression: Figures [^[a)| and (d) show average 
FP and FN rates after JPEG compression is applied to the 
three digital watermarking schemes. Erom the results, we can 
observe that Ml and M3 are very robust to JPEG compression 
with low EP and EN rates (the EP rate < 10% and the EN rate 
« 0%) when QE> 55. However, when a JPEG compression 
process with a QE value of 50 is applied, the authentication 
watermarks in the watermarked images are nearly completely 
destroyed with a FP rate close to 100%. We can also observe 
that, compared with Ml and M3, M2 is less robust against 
JPEG compression, especially when QF< 85. 

Eigures [^(a)| and [(d)] show average PSNR and SSIM values 
corresponding to the three digital watermarking schemes. The 
results show that Ml and M3 can provide reasonably good 
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Fig. 8; Average FP and FN rates of Ml, M2 and M3 w.r.t. different parameter values of attacks. 




Variance 

(c) PSNR against multiplicative noise 





Variance 

(f) SSIM against multiplicative noise 


Fig. 9: Average perceptual quality of images recovered by Ml, M2 and M3 w.r.t. different parameter values of attacks. 


image quality if QF> 50, but M2’s performance drops rapidly influence the quality of the recovered image. Between Ml and 
when QF< 85. The general trend matches the results on M3, we can also observe that M3 performs slightly better in 
the authentication accuracy since any false detections will terms of SSIM but significantly better in terms of PSNR. 
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2) Additive & Multiplicative Gaussian White Noises: Fig¬ 
ures [8 (b) (e) ^ and show average FP and FN rates when 
noises are added for the three digital watermarking schemes. 

The results on the FP rates show that M3 outperforms Ml 
significantly and M2 is the worst among the three. The average 
FN rates of all the three schemes remain close to 0% so there 
is no noticeable difference among them. 

Figures ^ b) (e) (c) and (f) show average PSNR and 


SSIM values of the 100 images recovered by the three digital 
watermarking schemes after different levels of noise are added. 
As expected the average quality of recovered images largely 
decreases smoothly as the variance (energy) of noise increases. 
Among the three schemes, M2 is again the worst performing 
scheme and M3 outperforms Ml significantly in terms of both 
PSNR and SSIM (for the latter after the variance of the noise 
goes beyond a threshold). 

V. More Discussion 

The previous section gives evidence about the usefulness of 
OR-Benchmark. While it is clear that OR-Benchmark has the 
potential to be a useful framework for the digital watermarking 
community, we would like to highlight that benchmarking 
complicated systems like digital watermarking schemes is not 
a simple matter and a more careful design of the benchmarking 
task is needed. In other words, the benchmarking task has to be 
designed on an ad hoc basis by the user of the benchmarking 
system, which is supported by the high reconfigurability 
and extensibility of the OR-Benchmark framework. While 
benchmarking tasks have to be designed individually, there are 
known common issues that we need to pay special attention to. 
For instance, it has been well known that using different PQA 
metrics may lead to different results when comparing different 
digital watermarking schemes. This has been demonstrated 


When the values of performance indicators are not aligned, 
we will need to find a way to average the results cross all test 
images. One approach is to fit a curve for each image covering 
a continuous range of the control factor and then to average all 
those curves produced for all test images. Let us show how this 
can be done using PSNR as an example. The task here is to 
get a continuous bpp-PSNR function for each test image based 
on a finite number of (bpp, PSNR) points, and then average 
the bpp-PSNR functions of all images to get an average bpp- 
PSNR function. To this end, denote the bpp-PSNR function for 
the Ath test image by /i(-). Since we have no knowledge of 
each individual function /i(-), we simply connect all the (bpp, 
PSNR) points to form a piecewise linear function. We also 
limit the domain of fi{-) to [mini(bpp), maxi(bpp)], the range 
between the minimum and maximum bpp values observed. 


Figure 10 shows 100 bpp-PSNR functions estimated from 
100 images for the digital watermarking scheme Ml. As 
expected, those functions do not have aligned domains since 
the minimum and maximum bpp values vary from image to 
image. In order to align all the functions, we extend all their 
domains to (— 00 , 00 ) and assign fi{x) = 0 when the bpp 
value X goes out of [mini(bpp), maxi(bpp)]. 


partly from the results shown in Sec. IV where PSNR and 
SSIM do not always give the same results (e.g. for M3). 

To further highlight the subtlety of performance evaluation 
of digital watermarking schemes, in this section we show a 
concrete example related to the “visual quality of recovered 
image vi. JPEG compression attack” issue, which is about 
the use of QF as the control factor of JPEG compression 
to compare performance of digital watermarking schemes as 



Eig. 10; The bpp-PSNR functions of 100 images recovered by 
the digital watermarking scheme Ml. 

After making the above preparation, the average bpp-PSNR 
function f{x) for all N test images can be defined as follows; 


shown in Sec. IV-E While this is a common practice to use 
QE as the control factor, it can be reasonably argued that QE 
is not necessarily a good factor for this purpose because it has 
different impacts on different images. One alternative is bit per 
pixel (bpp), which is a more direct measure of compression 
efficiency than QE. Now we will show what will happen if 
we switch from QE to bpp for the same benchmarking task 


fix) = 


EtiMx) 

E»=iSign(/i(x)) 


( 1 ) 


described in Sec. IV To simplify the discussion, we focus on 


the visual quality of recovered images only. 

Switching from QE to bpp immediately raises a problem; 
we can control QE directly to have a fixed set of values 
for all digital watermarking schemes, but we cannot control 
bpp directly as it is not an encoding parameter but a post¬ 
compression metric. The hxed set of values is important 
because we need to calculate average performance indicators 
which can be done if all performance indicators are aligned. 


where sign(a;) = 0 for a: < 0 and 1 otherwise, which is used 
to count only bpp-PSNR functions covering x. 

The above approach can be easily generalized to any IQA 
metrics. We added two new PE algorithms to the PE Library 
and produced the performance comparison results for the 
average recovered image quality w.r.t. JPEG compression as 
shown in Eigure Compared with the results shown in 
Eigures S^a) and |(d)| we can see some clear differences in 
the conclusion of the performance comparison; while M3 
remains the best scheme as a whole, M2 now outperforms 
Ml when the bpp value goes above a threshold. This example 
demonstrates the big impact of benchmarking details in how 
the performance indicators are handled on the final results. 
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Fig. 11: Reproduction of the results in Figs. |^a) and (d) by 
replacing the control factor QF by bpp. The x-axis is the bpp 
value and the y-axis is the PSNR/SSIM value. 

VI. Conclusion AND Future Work 

In this paper, we present OR-Benchmark, an open and 
highly reconhgurable general-purpose benchmarking frame¬ 
work, to meet the needs of benchmarking different digital 
watermarking schemes. To the best of our knowledge, this 
is the hrst and the only benchmarking framework supporting 
all known types of digital watermarking schemes including 
complicated ones involving multiple types of watermarks. We 
implemented a prototype as a MATLAB software package, 
and give a case study on three image authentication and self¬ 
restoration watermarking schemes to showcase the usefulness 
of OR-Benchmark as a convenient and flexible tool. 

Although OR-Benchmark as a general framework can easily 
support any media type, attacks, test multimedia datasets, and 
PE algorithms, our current implementation has mainly built-in 
functional units for digital images. The Offline Analyzer is also 
tailored towards our own needs for benchmarking some special 
types of digital watermarking schemes. In future we plan to 
add more functional units to the prototype so that users can 
use it without adding too many user-defined algorithms but 
focus on the digital watermarking schemes themselves. We 
also plan to release our MATLAB prototype under an open 
source license and call for contributions from the whole digital 
watermarking community. A dedicated website will be set up 
to host related documents and the source code of our prototype 
implementation. 
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